Trading Freshness for Performance in Distributed Systems
نویسندگان
چکیده
Many data management systems are faced with a constant, high-throughput stream of updates. In some cases, these updates are generated externally: a data warehouse system must ingest a stream of external events and update its state. In other cases, they are generated by the application itself: large-scale machine learning frameworks maintain a global shared state, which is used to store the parameters of a statistical model. These parameters are constantly read and updated by the application. In many cases, there is a trade-off between the freshness of the data returned by read operations and the efficiency of updating and querying the data. For instance, batching many updates together will significantly improve the update throughput for most systems. However, batching introduces a delay between when an update is submitted and when it is available to queries. In this dissertation, I examine this trade-off in detail. I argue that systems should be designed so that the trade-off can be made by the application, not the data management system. Furthermore, this trade-off should be made at query time, on a per-query basis, not as a global configuration. To demonstrate this, I describe two novel systems. LazyBase is a data warehouse system originally designed for to store meta-data extracted from enterprise computer files, for the purposes of enterprise information management. It batches updates and processes them through a pipeline of transformations before applying them to the database, allowing it to achieve very high update throughput. The novel pipeline query mechanism in LazyBase allows applications to select their desired freshness at query time, potentially reading data that is still in the update pipeline and has not yet been applied to the final database. LazyTables is a distributed machine learning parameter server a shared storage system for sparse vectors and matrices that make up the bulk of the data in many machine learning applications. To achieve high performance in the face of network delays and performance jitter, it makes extensive use of batching and caching, both in the client and server code. The Stale Synchronous Parallel consistency model, conceived for LazyTables, allows clients to specify how out-of-sync different threads of execution may be.
منابع مشابه
Service Differentiation in Real-Time Main Memory Databases
The demand for real-time database services has been increasing recently. Examples include sensor data fusion, stock trading, decision support, web information services, and data-intensive smart spaces. In these systems, it is essential to execute transactions in time using fresh (temporally consistent) data. Due to the high service demand, many transactions may miss their deadlines regardless o...
متن کاملAdaptive Update Policy for Proactive Management of Deadline Miss Ratio and Data Freshness in Real-Time Database Models
Important applications, like e-commerce, online stock trading, traffic control demand real-time data services. Conventional database perform poor at these applications. A database for real-time data services has to support timing constraints and temporal consistency in addition to supporting characteristics of a conventional database system. In other words, it is desirable to execute transactio...
متن کاملRecent-secure authentication: enforcing revocation in distributed systems
A general method is described for formally specifying and reasoning about distributed systems with any desired degree of immediacy for revoking authentica-tion. To effect revocation, 'authenticating entities' impose freshness constraints on credentials or authenticated statements made by trusted intermediaries. If fresh statements are not presented, then the authentica-tion is questionable. Fre...
متن کاملFLUTE: A Flexible Real-Time Data Management Architecture for Performance Guarantees
Efficient real-time data management has become increasingly important as real-time applications become more sophisticated and data-intensive. In data-intensive real-time applications, e.g., online stock trading, agile manufacturing, sensor data fusion, and telecommunication network management, it is essential to execute transactions within their deadlines using fresh (temporally consistent) sen...
متن کاملA Heuristic Approach to Distributed Generation Source Allocation for Electrical Power Distribution Systems
The recent trends in electrical power distribution system operation and management are aimed at improving system conditions in order to render good service to the customer. The reforms in distribution sector have given major scope for employment of distributed generation (DG) resources which will boost the system performance. This paper proposes a heuristic technique for allocation of distribut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014